Skip to content

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25

Open
macarte wants to merge 14 commits intomacarte/rebased-win-arrch64-fixesfrom
macarte/rebased-additional-fixes
Open

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25
macarte wants to merge 14 commits intomacarte/rebased-win-arrch64-fixesfrom
macarte/rebased-additional-fixes

Conversation

@macarte
Copy link
Contributor

@macarte macarte commented Feb 23, 2026

Windows AArch64 Fixes — PR Packaging Plan

The 11 commits (combined, with internal reverts resolved) touch 17 files and decompose into 6 logically distinct PRs.


PR 1: Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime

Files:

  • make/autoconf/flags-cflags.m4 — add /volatile:ms to JVM_CFLAGS for AArch64
  • src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hppPlatformOrderedLoad/PlatformOrderedStore using LDAR/STLR intrinsics
  • src/hotspot/os_cpu/windows_aarch64/orderAccess_windows_aarch64.hpp
    READ_MEM_BARRIER/WRITE_MEM_BARRIER/FULL_MEM_BARRIER using std::atomic_thread_fence

Rationale: MSVC's /volatile:iso (default on ARM64) means volatile reads/writes are plain LDR/STR with no
acquire/release semantics, breaking HotSpot's assumption of volatile ≈ acquire/release throughout the C++ runtime. The /volatile:ms flag restores those semantics for all JVM code. With /volatile:ms active, std::atomic_thread_fence() in orderAccess provides correct compiler and hardware ordering. The PlatformOrderedLoad/PlatformOrderedStore specializations provide single-instruction LDAR/STLR for Atomic::load_acquire()/Atomic::release_store(), avoiding the redundant dmb that the generic fallback would emit.

Tests unblocked: Broad set of intermittent failures across ObjectMonitor, ParkEvent, lock-free algorithms in the
HotSpot runtime.


PR 2: ObjectMonitor Dekker protocol StoreLoad barriers in try_spin()

Files:

  • src/hotspot/share/runtime/objectMonitor.cpp — add OrderAccess::storeload() after set_successor(current) in both places in try_spin()

Rationale: On ARM64, the Dekker protocol between try_spin() (ST _succ → LD _owner) and exit() is broken without an explicit StoreLoad barrier. Even with PR 1's /volatile:ms (which gives acquire/release semantics), volatile store + volatile load to different addresses does not imply StoreLoad ordering on ARM64. This causes missed wakeups and thread starvation. This is a shared (not Windows-specific) fix for all ARM64 platforms.

Tests unblocked: Starvation test, various monitor contention hangs.

Note: Could be folded into PR 1 since it depends on the barrier infrastructure, but it's a clean logical unit on its own and touches shared code (not os_cpu).


PR 3: ARM64 currentCarrierThread intrinsic — MO_ACQUIRE for OopHandle loads + @DontInline on Continuation.yield()

Files:

  • src/hotspot/share/c1/c1_GraphBuilder.cpp — don't inline @ChangesCurrentThread methods into non-@ChangesCurrentThread callers
  • src/hotspot/share/c1/c1_LIRGenerator.cppIN_NATIVE | MO_ACQUIRE on OopHandle load in do_JavaThreadField
  • src/hotspot/share/gc/shared/c1/barrierSetC1.cpp — honor MO_ACQUIRE decorator with membar_acquire()
  • src/hotspot/share/opto/library_call.cppIN_NATIVE | MO_ACQUIRE on OopHandle load in C2 current_thread_helper
  • src/java.base/share/classes/jdk/internal/vm/Continuation.java@DontInline on Continuation.yield()

Rationale: On ARM64, after a virtual thread migrates between carriers, the OopHandle dereference for currentCarrierThread can be reordered with subsequent dependent loads (e.g., Thread.cont), observing stale data. The MO_ACQUIRE ensures ordering. The @DontInline on yield() and the @ChangesCurrentThread inlining guard prevent the compiler from inlining across carrier-change boundaries. This is a shared ARM64 fix (not Windows-specific).

Tests unblocked: Virtual thread scheduling failures, PingPong test.


PR 4: ForkJoinPool carrier starvation — tryCompensateForMonitor + CarrierThread.beginMonitorBlock()

Files:

  • src/hotspot/share/runtime/objectMonitor.cppcompensate_pinned_carrier() / end_compensate_pinned_carrier() + call sites in enter_with_contention_mark
  • src/java.base/share/classes/java/util/concurrent/ForkJoinPool.javatryCompensateForMonitor(), beginMonitorCompensatedBlock(), updated endCompensatedBlock(), FJP access registration
  • src/java.base/share/classes/jdk/internal/access/JavaUtilConcurrentFJPAccess.java — add beginMonitorCompensatedBlock to interface
  • src/java.base/share/classes/jdk/internal/misc/CarrierThread.javabeginMonitorBlock() + ForkJoinPools.beginMonitorCompensatedBlock()

Rationale: When a pinned virtual thread's carrier blocks on a contended monitor, the ForkJoinPool has no visibility that its worker is blocked. All carriers can end up pinned, preventing the VT holding the contested lock from ever getting a carrier → deadlock. This PR adds FJP compensation that activates an idle worker or creates a spare carrier. This is a correctness fix for virtual threads on all platforms.

Existing Work JDK-8345294 (carrier starvation when pinned VTs contend on monitors (92e9ac6)

  • Tweaks deactivate() — widening the scan window from n + (n << 1) to n << 2, and fixing the reactivation CAS to use a fresh ctl value instead of the stale qc. This makes idle workers more responsive at detecting newly-queued tasks. It helps scenarios where the pool already has idle workers but they're slow to notice new work.

We still have the fundamental problem where all carriers are pinned and blocked on monitors with no idle workers left. In the Starvation.java test scenario:

  1. One VT holds a lock via synchronized(lock)
  2. nprocs - 1 other VTs are pinned (VThreadPinner.runPinned) and block on the same lock
  3. Each blocked carrier still counts as "active" in ForkJoinPool.ctl (RC field)
  4. When the lock holder unmounts, signalWork() sees no under-active workers → no carrier is available to run the unmounted VT that holds the lock → deadlock

JDK-8345294's fix helps because wider scanning makes it less likely that idle workers miss the signal, but it only works if there are idle workers. When every carrier is pinned and blocked, scanning changes nothing.

This compensation approach solves this directly

Before a carrier parks in enter_internal(), it calls tryCompensateForMonitor() which either:

  • Branch 1: Activates an idle worker (like JDK-8345294's fix, but proactively)
  • Branch 3: Creates a new spare carrier and decrements RC, making the blocked carrier look "inactive" so signalWork() can dispatch the VT holding the lock to the new spare

The existing timed-park heuristic in enter_internal() (exponential backoff from 1ms to 1000ms) is another mitigation that's already in upstream. It periodically wakes blocked carriers to retry, which can break deadlocks but:

  • Adds latency (up to 1 second in the worst case)
  • Doesn't actually free a carrier for the lock holder

Summary: Three layers of defense

Layer Purpose
Doug Lea's deactivate() fix Helps signal propagation
Timed-park in enter_internal() Breaks some deadlocks with latency cost
compensate_pinned_carrier() + tryCompensateForMonitor() The actual fix that guarantees a replacement carrier exists

Tests unblocked: Starvation test (fully), carrier deadlock scenarios.

Note: The introduced fences execute on every platform. However, the fences are correctness-critical on ARM64, functionally redundant on x86, and low-cost on the non-hot paths where they appear.

  • Without the fences, on ARM64 both sides can miss each other's writes: acquire doesn't see the state update, release doesn't see WAITING, and the thread is never unparked.

  • On x86, these are all redundant — TSO guarantees StoreLoad for stores and loads to different addresses. The fullFence() compiles to a lock addl but has negligible cost since release/releaseShared are not hot tight loops (they're called once per lock release).

Dependency: Depends on PR 2's includes (symbolTable.hpp, javaCalls.hpp) being in objectMonitor.cpp. Package them together or add the includes here.


PR 5: ARM64 Dekker-pattern StoreLoad fences in java.util.concurrent + VirtualThread

Files:

  • src/java.base/share/classes/java/lang/VirtualThread.javaU.fullFence() in afterYield (PARKING, BLOCKING, WAITING paths), afterDone, unpark, unblock, joinNanos
  • src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.javaVarHandle.fullFence() after CAS in xfer
  • src/java.base/share/classes/java/util/concurrent/SynchronousQueue.javaVarHandle.fullFence() after CAS in xferLifo
  • src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.javaU.fullFence() in acquire (after setting WAITING), release, and releaseShared

Rationale: On ARM64, volatile write (STLR) + volatile read (LDAR) to different addresses does NOT provide StoreLoad ordering. All these sites have Dekker-like patterns where one side writes field A then reads field B, while the other writes B then reads A. Without an explicit StoreLoad fence, both sides can miss each other's stores. This is a shared ARM64 correctness fix that applies to all ARM64 JDKs (Linux too, though timing makes it less likely to manifest).

Notes: This modifies the internal invariants of the existing lock-free algorithms. The fences are correct (ARM64 doesn't provide StoreLoad from STLR→LDAR to different addresses), but they do add overhead on every signal/park cycle.

Tests unblocked: SynchronousQueue, LinkedTransferQueue, virtual thread parking/unparking races, JSR166TestCase failures.


PR 6: TestTrampoline Windows fix

Note:

  • this fix will be upstreamed before the rest of these commits; it's included here in order to keep GHA green (passing)

Files:

  • test/hotspot/jtreg/compiler/c2/aarch64/TestTrampoline.java — warm up UTF-16 String.charAt path on Windows before trampoline test

Rationale: On Windows, the JNU encoding may cause String.charAt UTF-16 path to not get hot enough for C2 inlining, changing trampoline generation behavior. Standalone test fix.


Suggested Integration Order

Order PR Risk Scope
1 MSVC /volatile:iso fix (build + platform atomics) Low — Windows AArch64 only os_cpu, make
2 ObjectMonitor try_spin() StoreLoad Low — all ARM64 runtime
3 currentCarrierThread intrinsic acquire + @DontInline Medium — C1/C2/shared compiler, java.base
4 FJP carrier starvation compensation Medium — runtime + java.base runtime, j.u.c
5 Dekker fences in j.u.c + VirtualThread High reviewability concern — touches Doug Lea's code java.base
6 TestTrampoline Windows fix Trivial test

PR 1 should go first as the foundation. PRs 2 and 6 are independent and can go in parallel. PR 3 is independent of 2/4. PR 4 depends on 2 (for the includes). PR 5 is the most contentious (touching j.u.c. core) and benefits from having all other fixes in first to narrow down remaining test failures.

…dering

On ARM64's weakly-ordered memory model, pointer loads (LoadP/LoadN)
can be reordered with subsequent dependent loads by the hardware. This
causes failures in concurrent data structures used by virtual thread
scheduling (SynchronousQueue, LinkedTransferQueue) where a loaded
pointer is immediately dereferenced but the dependent field load sees
stale data.

Force MemNode::acquire at IR creation time in LoadNode::make() for all
LoadP/LoadN nodes on AARCH64 when barrier_data == 0. This must happen
at IR creation rather than instruction selection because C2's optimizer
can eliminate or reorder loads between those phases — a load removed by
LoadNode::Identity() or hoisted by Ideal() will never reach the matcher
to receive a barrier.

Loads with barrier_data != 0 (GC reference barriers) are excluded since
GC barrier patterns emit their own ordering sequences and have no
acquire variants in the .ad file.

Also reverts needs_acquiring_load() to the upstream form (is_acquire()
only), since the opcode-based check is now redundant — all pointer
loads already have acquire set on the IR node.
…atching

Lower the cost of loadP_acq and loadN_acq from (4*INSN_COST + VOLATILE_REF_COST)
to VOLATILE_REF_COST so the instruction matcher prefers these patterns over the
loadP_volatile/loadN_volatile + addP_reg_imm combination when the address has a
foldable constant offset.

The acq patterns use ldr+dmb with memory8/memory4 operands that support offset
addressing, while the volatile patterns use ldar with an indirect operand (no
offset). At the previous higher cost, the matcher would choose the cheaper
ldar variant plus a separate addP_reg_imm instruction rather than folding the
offset into a single ldr+dmb. This left unexpected addP_reg_imm nodes in the
compiled output, causing TestCombineAddPWithConstantOffsets to fail on AArch64
fastdebug builds.
MSVC's /volatile:iso means volatile reads/writes have no implicit
acquire/release on ARM64, unlike GCC/Clang. This causes memory ordering
violations in the interpreter, C1, and C2 JIT tiers.

Changes:

1. orderAccess_windows_aarch64.hpp: Use __dmb() intrinsics instead of
   std::atomic_thread_fence() for READ_MEM_BARRIER (dmb ishld),
   WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish).

2. atomic_windows_aarch64.hpp: Implement PlatformOrderedLoad (__ldar)
   and PlatformOrderedStore (__stlr) for 1/2/4/8-byte widths, ensuring
   acquire/release semantics for Atomic::load_acquire/store_release.

3. barrierSetAssembler_aarch64.cpp: Add dmb ishld after interpreter
   oop reference loads to enforce load ordering.

4. barrierSetC1.cpp: Add membar_acquire() after C1 oop reference loads.

5. memnode.cpp: Add MemNode::acquire to C2 LoadP/LoadN for T_OBJECT
   and T_NARROWOOP when barrier_data == 0. T_ADDRESS is intentionally
   excluded to avoid breaking C2 TLAB inline allocation ordering.
Add /volatile:ms to JVM_CFLAGS for Windows AArch64 in build config
Override PlatformLoad/PlatformStore with LDAR/STLR intrinsics as defense-in-depth
Add StoreLoad barrier in ObjectMonitor::try_spin() Dekker protocol
Use acquire loads and opaque base update in ForkJoinPool work-stealing
@swesonga
Copy link
Member

@macarte, which docs or sources are you looking at that show that "GCC/Clang on AArch64 default to /volatile:ms"?

@macarte
Copy link
Contributor Author

macarte commented Feb 27, 2026

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

# and GCC/Clang AArch64). Use /volatile:ms to restore those semantics and
# prevent memory ordering bugs in ObjectMonitor, ParkEvent, and other
# lock-free algorithms that use plain volatile fields.
$1_CFLAGS_CPU_JVM="/volatile:ms"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the same style for these flags as the rest of the make files -volatile:ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@swesonga
Copy link
Member

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

sorry, I was asking about a reference for GCC/Clang, which the PR mentions

@swesonga
Copy link
Member

out of curiosity, what was the cause of the earlier failure in the debug build? was it a syntax issue in the .ad file?


// Don't inline a method that changes Thread.currentThread() except
// into another method that is annotated @ChangesCurrentThread.
if (callee->changes_current_thread()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're fixing atomics handling on Windows AArch64, won't we need additional justification for why we're changing code that other platforms are using given that they haven't run into any of these concurrency issues?

Copy link
Contributor Author

@macarte macarte Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem this solves is a compiler correctness issue, not a hardware barrier issue. If C1 inlines a @ChangesCurrentThread method (like Continuation.yield() or carrier-thread switching code) into a caller that isn't annotated @ChangesCurrentThread, the compiler can cache the result of Thread.currentCarrierThread() across the inlined call — observing the old carrier thread after the virtual thread has migrated to a new one.

This is broken on all architectures, not just ARM64:

  • On ARM64, it manifests more easily because the stale OopHandle dereference can also be reordered (which the MO_ACQUIRE fix in c1_LIRGenerator/library_call addresses separately).

  • On x86, it could still manifest if C1 keeps the currentCarrierThread result in a register across the inlined yield — TSO doesn't help because there's no memory access to reorder, the value is simply never re-read.

The check is also very low cost — it only runs during C1 inlining decisions, not at runtime. And @ChangesCurrentThread is used on a tiny number of methods, so this almost never triggers.

Bottom line: This is a correct-by-construction inlining guard that prevents a real compiler bug on all platforms. No architecture guard needed.

// Don't inline a method that changes Thread.currentThread() except
// into another method that is annotated @ChangesCurrentThread.
if (callee->changes_current_thread()
&& !compilation()->method()->changes_current_thread()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific change, for example, wouldn't it be a bug for an inlined method that changes threads to not read the correct state given that all its registers are correct and the load/acquire semantics have been fixed?

Copy link
Contributor Author

@macarte macarte Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When C1 inlines a @ChangesCurrentThread method into a non-@ChangesCurrentThread caller, the compiler can CSE (common subexpression elimination) or hoist the currentCarrierThread intrinsic across the carrier-change boundary. The result: the caller reuses a register-held value from before the yield/mount/unmount and never re-reads it.

The MO_ACQUIRE fix addresses a different problem — it ensures that when you do load currentCarrierThread, subsequent dependent loads (e.g., Thread.cont) see consistent data. But it can't force a re-load that the compiler optimized away entirely.

Consider this concrete scenario after inlining:

// Caller (not @ChangesCurrentThread):
Thread carrier = Thread.currentCarrierThread();  // C1 intrinsic → LDAR, result in R0
// ... inlined @ChangesCurrentThread code ...
//   yield() freezes frame (saves R0=old_carrier)
//   VM mounts VT on new carrier
//   thaw restores frame (R0 still = old_carrier)
// ... back in caller ...
carrier.cont  // uses stale R0 — wrong carrier!

The MO_ACQUIRE on the LDAR is irrelevant here because there is no second LDAR — C1 reused R0. And the "correct registers" after thaw are correct in the sense that they're faithfully restored from freeze — but they contain the pre-yield values, which is exactly the stale data.

The inlining guard ensures @ChangesCurrentThread methods stay as real calls, which forces C1 to treat currentCarrierThread as potentially invalidated across the call boundary and re-read it.

So: MO_ACQUIRE = correct ordering on loads that happen. Inlining guard = ensuring loads actually happen. They're complementary.

q = p.next;
if (p.isData != haveData && haveData != (m != null) &&
p.cmpExItem(m, e) == m) {
// Full fence (StoreLoad) for Dekker with await() which
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

  • This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

  • On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

else if (p.cmpExItem(m, e) != m)
p = head; // missed; restart
else { // matched complementary node
// Full fence (StoreLoad) for Dekker with await() which
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

macarte and others added 3 commits March 5, 2026 17:30
With /volatile:ms set in JVM_CFLAGS for Windows AArch64, MSVC gives
volatile accesses acquire/release semantics (LDAR/STLR), making several
defense-in-depth changes unnecessary and harmful to performance:

- Revert orderAccess_windows_aarch64.hpp from __dmb() intrinsics back to
  std::atomic_thread_fence(). With /volatile:ms, volatile accesses are
  already compiler-barriered, so std::atomic_thread_fence() works
  correctly for both compiler and hardware ordering.

- Remove PlatformLoad<N> and PlatformStore<N> overrides from
  atomic_windows_aarch64.hpp. These forced LDAR/STLR on every
  Atomic::load()/Atomic::store() call, but /volatile:ms already
  produces identical codegen for the generic volatile-dereference path.
  The PlatformOrderedLoad/PlatformOrderedStore specializations (for
  Atomic::load_acquire/release_store) are retained as they provide
  single-instruction LDAR/STLR instead of the generic dmb+LDR/STR
  fallback.
…-additional-fixes

Remove redundant memory ordering overrides now that /volatile:ms is used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants